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EXPLORING TEACHER EFFECTIVENESS 
USING HIERARCHICAL LINEAR MODELS: 
STUDENT-AND CLASSROOM-LEVEL PREDICTORS 
AND CROSS-YEAR STABILITY 
IN ELEMENTARY SCHOOL READING 


Teacher effectiveness and evaluation using student growth mea¬ 
sures is a popular reform strategy> in education. Teachers can make a dif¬ 
ference in student academic growth, but a question that begs an answer 
is how to go about measuring this impact. This study examines models of 
teacher effectiveness and the development of hierarchical linear models 
(HLM) using fourth grade end-of-year state accountability> reading test 
scores as the outcome variable. An extensive review of literature was con¬ 
ducted to assess the use of HLM in educational settings, particularly as 
related to teacher effectiveness analyses. Although multiple student vari¬ 
ables were explored, previous reading test scores was the most significant 
student-level variable while teachers ’years of experience was used as a 
classroom-level variable. This model produced a classroom effectiveness 
index that was notably consistent across three years of data for the same 
teachers. Implications for policy, practice, and research are discussed. 


Teacher effectiveness is an important area of investigation that has 
emerged in recent years among educational researchers. A growing body 
of research has shown that teacher effectiveness is a strong predictor of stu¬ 
dent achievement (Darling-Hammond, 1996; Darling-Hammond, 2000; Ha- 
nushek & Lindseth, 2009; Munoz & Chang, 2007; Nye, Konstantopoulos, & 
Hedges, 2004; Sanders & Rivers, 1996; Stronge, Ward, Tucker, & Hindman, 
2008). Yet, teachers receive little formative or summative feedback on their 
teaching activities; in general, teacher evaluation is a compliance exercise 
based on non-informative checklists and with the common conclusive rating 
of “satisfactory” (Weisberg, Sexton, Mulhem, & Keeling, 2009). Further¬ 
more, research indicates that schools serving minority, low-income students 
have the most difficulty recruiting and retaining effective teachers (Darling- 
Hammond, 2000). This disparity in teacher effectiveness between schools 
and districts contributes to the student achievement gap. 

In light of these findings, relatively recent educational policies such 
as No Child Left Behind and Race to the Top have attempted to address teach¬ 
er effectiveness at the policy level. The authorization in 2001 of Title 1 of 
the Elementary and Secondary Act, commonly referred to as No Child Left 
Behind (NCLB), specified that teachers must be “highly qualified” to ensure 
that all students learn and demonstrate academic proficiency. To be consid¬ 
ered “highly qualified,” a teacher of a core academic subject is required to 
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hold a bachelor’s degree, have full state certification or licensure, and dem¬ 
onstrate subject matter competence (measured using PRAXIS tests). This 
focus from NCLB for determining the quality of a teacher is based on teach¬ 
er preparation—the qualifications or inputs from teachers’ training. 

This more input-oriented concept of teacher effectiveness was ex¬ 
panded at the national level with The American Recovery and Reinvest¬ 
ment Act of2009 to focus on teacher effectiveness in terms of outcomes 
or what teachers are able to do to improve student achievement. The prin¬ 
ciple upon which Race to the Top (RTTT) is founded calls for teacher ef¬ 
fectiveness to be determined from a combination of measures using both 
students’ growth indicators and observation-based assessments. Student 
growth is based on the change in student achievement for an individu¬ 
al student between two or more points in time rather than on proficiency 
data. The emphasis placed on student outcomes has served to link teacher 
effectiveness and quality with teacher evaluations. 

In a way, teacher effectiveness has been equalized to student 
achievement (Stronge, 2010). In research that supports this statement, 
Sanders and Rivers (1996) found an enormous gap in the achievement lev¬ 
els of students that had three consecutive years of teachers rated as “high” 
compared to those students that had three consecutive years of teachers 
rated as “low.” According to Stronge and Tucker (2000), “there are dis¬ 
tinctive qualities that epitomize good teachers—and one of those qualities 
is the ability to make a difference in students’ lives” (p. 1). When it comes 
to definitions, Jerald (2003) stated that “we must define good teaching by 
results, not by personal characteristics or our preconceived notions. When 
the goal is student learning, seeming to be a good teacher and actually be¬ 
ing a good teacher can be very different” (p. 13). 

The overarching goal of this research study is to develop a statisti¬ 
cal model that will identify those teachers that are “actually being a good 
teacher.” In order to find those “good teachers,” the models to be examined 
for this study are based on “value added” models. The value-added ap¬ 
proach is based on the growth a student makes from the time of entering a 
classroom to the time of their exiting that classroom. This approach is dif¬ 
ferent than most accountability models that examine just the end-of-year 
achievement scores and do not take into account students’ backgrounds, 
students’ prior achievement, or students’ effect on each other. Of equal im¬ 
portance, this study looks into value-added fluctuations from year to year 
as succeeding cohorts of students move through their classrooms. Consis¬ 
tency of effectiveness measures continues to be debated. 

Doran (2003) points out that most accountability systems, which 
are not value-added, are basically (a) invalid or misleading since student 
achievement scores are affected by external variables that are outside the in¬ 
fluence of school and teachers, (b) they fail to recognize the growth of the 
student since entering the accountable environment, (c) they do not take into 
account the cumulative effect of prior learning, and (d) they use cut-score 
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categories that mismeasure academic performance. The strengths of a value- 
added model have lead Thum and Bryk (1997) to state that “from a purely 
technical perspective, the arguments seem very clear: Anything other than a 
value-added-based approach is simply not defensible” (p. 102). 

Other authors disagree (Kupermintz, 2002; McCaffrey, Lock- 
wood, Koretz, & Hamilton, 2003). According to Kupermintz (2002) to la¬ 
bel a teacher as effective based on the gains of their students is logically 
faulty for several reasons: (a) teachers with fewer students have less accu¬ 
rate data and their estimates are more likely to be “pulled” toward the dis¬ 
trict average; (b) there are potentially conflicting explanations, instead of 
teacher effectiveness, for the performance of students on tests; (c) value- 
added models do not adequately separate the relationship between teacher 
characteristics and student characteristics and the observed gains; and, (d) 
teachers assigned higher performing students are more likely to become 
labeled as effective, while teachers assigned high-risk students are more 
likely to be labeled as ineffective. Kupermintz (2002) stated that possible 
influences on student learning include: “personal propensities and resourc¬ 
es (both cognitive and noncognitive), physical and mental maturation, 
home environment, cultural heritage, institutional and informal commu¬ 
nity resources” (p.294). McCaffrey et al. (2003) concluded that “the exist¬ 
ing research base on VAM [value-added models] suggests that more work 
is needed before the techniques can be used to support important decisions 
about teachers or schools” (p. 111). 

Hierarchical Linear Models and Value-Added Education 

Pedhazur (1997) stated that multilevel analysis uses information 
from all available levels (e.g., students, classrooms, schools), making it 
possible to learn how variables at one level affect relations among vari¬ 
ables at another level. Moreover, multilevel analysis affords estimation of 
variance between groups as distinct from variance within groups. Pedha¬ 
zur further stated that multilevel models “yield more realistic standard er¬ 
rors” (p.692) than Ordinary Least Squares (OLS) estimates. 

A number of examples of educational uses of multilevel methods 
are presented below. For the most part, the aim of the researchers who per¬ 
formed these studies has been to estimate student data and student achieve¬ 
ment on school effects (Marsh, Hau, & Kong, 2002; Pituch, 1999), teacher 
and school characteristics on student achievement (Bankston & Caldas, 
2000; Berends, 2000; Guthrie, 2001; Heck & Crislip, 2001; Swanson 
& Stevenson, 2002), school improvement over time (Mandeville, 1988; 
Mandeville & Anderson, 1987), student achievement in a specific disci¬ 
pline (Carbonaro & Gamoran, 2002; Lee & Bryk, 1989; Raudenbush, Fo- 
tiu, & Cheong, 1998; Wilkins & Ma, 2002), and specific strategies for stu¬ 
dent achievement (Bums & Mason, 2002; Desimone, Porter, Garet, Yoon, 
& Birman, 2002; Goldstein, Yang, Omar, Turner, & Thompson, 2000). 
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Until recently, school effectiveness research has involved an ex¬ 
amination of the variables at either the school level (i.e., aggregated from 
all the students in that school but failing to account for individual effects) 
or the individual level (i.e., analyzing data at the individual student lev¬ 
el but failing to account for group effects). Since the early 1990s, multi¬ 
level statistical models, such as Hierarchical Linear Modeling (HLM) and 
mixed-model statistical analysis, have been gaining popularity in educa¬ 
tional research as a means to examine the effects of different levels of 
grouping on student achievement (e.g., classroom, schools, and districts). 

Over the past two decades HLM has emerged as a well-accepted 
statistical model to use when conducting a study of school effects within an 
educational setting. One reason for this growing acceptance is that HLM 
allows for the effects of the context to be taken into account. Moreover, 
HLM offers several advantages over other methodologies (Lee, 2000) and 
solves several difficulties with unit of analysis. Prior to multi-level analy¬ 
sis, researchers attempted to find statistical relationships between school 
factors and variables measured at the student level. The researchers then 
had to determine which level of analysis was appropriate. Lee stated that 
the researchers had to choose between the level where the intervention or 
effect was administered (school level) or the level where the intervention 
or effect was believed to occur (student level). 

Lee (2000) pointed out that there are three problems when using a 
single level method, like OLS multiple regression and analysis of variance 
(ANOVA): (a) aggregation bias, (b) misestimated standard errors, and (c) 
heterogeneity of regression. The first difficulty is the aggregation bias that 
can occur when a variable takes on different effects at diverse levels of ag¬ 
gregation. A second difficulty concerns the estimation of the standard er¬ 
rors used for statistical testing; for example, with multilevel data, mises¬ 
timated standard errors can occur when researchers treat individual cases 
as though they are independent (a standard assumption of OLS regression) 
when they are not. A third difficulty concerns heterogeneity of regression 
slopes, which means that relations between characteristics of students and 
academic achievement may vary across schools and may be a function of 
group level variables. 

To a substantial extent, HLM solves the problems of aggregation 
bias, misestimated standard errors, and heterogeneity of regression. First, 
the problem of aggregation bias is solved since HLM allows for the exami¬ 
nation of the data at more than one level of aggregation. Second, the prob¬ 
lem of misestimated standard error is avoided since the independence of 
cases is not an assumption of HLM. Finally, the problem of heterogeneity of 
regression is solved by HLM since multi-level procedure allows for the in¬ 
vestigation of grouping effects. 

There is a history of prior application of HLM in school systems. 
Over the last two decades, the State of Tennessee has adopted and ex¬ 
tensively used a value-added model similar to HLM. At the local district 
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level, the Dallas Texas Public Schools was an early adopter of hierarchi¬ 
cal modeling procedures to analyze student data aggregated at the district, 
school, and teacher levels. The shift in attention from assessing current 
level of performance to showing progress in learning has refined the way 
in which policy makers conceptualize educational outcomes. 

The Tennessee Value-Added Assessment System (TVAAS) is a 
statistical model developed to obtain unbiased estimates of the effect of 
teachers on the academic gains of students (Sanders, 2000). By using a 
student’s previous academic history, the students serve as their own con¬ 
trol for extraneous factors. Sanders and Horn (1995) stated that “TVAAS 
was developed on the premise that society has a right to expect that 
schools will provide students with the opportunity for academic growth 
regardless of the level at which the students enter the educational venue. 
In other words, all students can and should learn commensurate with their 
abilities” (p. 12). Holland (2001) stated that value-added models, such as 
TVAAS, can be used by decision-makers to evaluate teachers, the latest in¬ 
novative curriculum, and teacher preparation programs. 

The Dallas Texas Public Schools have used a value-added account¬ 
ability system for more than two decades (Webster & Mendro, 1997). The 
current system in Dallas combines the use of multiple regression and hi¬ 
erarchical linear modeling. The Dallas accountability model controls for 
many preexisting student differences in ethnicity, gender, language profi¬ 
ciency, and socioeconomic status (termed fairness variables). 

Purpose of the Study 

The purpose of this study is to measure teacher effectiveness using 
a value-added methodology—namely hierarchical linear modeling. Under 
this conceptualization, teacher effectiveness is operationally defined as the 
teacher’s impact in reading on statewide assessment. Since school data are 
an excellent fit with HLM, this study will examine an urban school dis¬ 
trict’s data using multilevel models to determine classroom effectiveness. 
The study used a multilevel model to control for selected students’ demo¬ 
graphic effects, previous academic achievement attainment, teacher char¬ 
acteristics, and classroom demographics to obtain a measure of teacher 
effectiveness. It is a two-level model with individual student characteris¬ 
tics serving as Level 1 variables and classroom characteristics serving as 
Level 2 variables. 

More specifically, the following research questions were ad¬ 
dressed: (a) Is there enough classroom variance (as measured by the in¬ 
tra-class correlation) to justify the use of HLM? (b) What student level 
variables are significant predictors of student achievement as measured 
by their reading scores? (c) What classroom level variables are significant 
predictors of student achievement as measured by their reading scores? 
(d) When combining both students and classroom level variables, what are 
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significant predictors of student achievement? (e) How consistent are the 
ratings of classroom level effectiveness over a three-year period of time? 

This research examined the usefulness of multi-level models when 
applied to teacher evaluation or classroom effectiveness. More specifical¬ 
ly, the first major objective was to develop an HLM model to identify ef¬ 
fective and ineffective classrooms using reading scores. The second major 
objective was to test the consistency of the teacher scores over a three-year 
period; in this regard, this study looks into value-added yearly fluctuations 
as succeeding cohorts of students move through their classrooms. 

In this age of accountability, schools are searching for effective 
methods to identify classrooms that consistently “add value” to a student’s 
education. It is hoped the study will add to the growing body of literature on 
the use of multilevel models and evaluating classroom level effectiveness. 

Methods 

The context for the study was elementary schools in the Jefferson 
County Public Schools (JCPS) in Louisville, Kentucky. The district is lo¬ 
cated in a large metropolitan area and has about 152 schools serving ap¬ 
proximately 100,000 students. JCPS educates a high percentage of at-risk 
urban students. The district has a student assignment plan based on man¬ 
aged choice, which facilitates the racial desegregation of its schools by 
providing students with transportation. An additional contextual factor to 
consider is that School-Based Decision Making (SBDM) is a model em¬ 
ployed for setting school policy consistent with district board policy to en¬ 
hance students’ achievement. 

Participants 

Reading scores of fourth grade elementary school students were 
analyzed from all elementary schools in JCPS. The data spanned three 
consecutive years (2001-2003). Two criteria were used when evaluating 
the data set prior to the analysis. First, the student must have been enrolled 
in the school system for a minimum of 100 school days (as regulated on 
the state’s accountability system). 3 Second, only classrooms with at least 
15 students were included in the analyses (Kreft & De Leeuw, 1998). b In 
total, 2,955 student records were removed from the three-year analyses. 
As a result of the selection criteria, the number of students included were 
5,837 (Year 1), 5,645 (Year 2), and 5,724 (Year 3), the numbers of teach¬ 
ers were 241 (Year 1), 235 (Year 2), and 236 (Year 3), drawn from 81 el¬ 
ementary schools. It should be noted which students tended to be elimi¬ 
nated due to the smaller class sizes (i.e., 15 students per classroom ): (a) 
special needs students, (b) English as a second language (ESL), and (c) 
gifted and talented. 
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Instrumentation 

The student level variables examined included the following: gen¬ 
der, race, socio-economic status (measured as free/reduced lunch and me¬ 
dian income by census tract), parents’ education (median education lev¬ 
el by census tract), attendance, age, Comprehensive Test of Basic Skills 
(CTBS) reading (administered in the Spring of the prior school year), and 
classification (special needs, English as a second language, or gifted and 
talented). Table 1 describes the Level 1, student-related variables. 

Table 1 

Level 1 Individual Student Variables 


Variables 

KCCT reading scale score 
Student total absences 
Student days membership 
Parents’ education index 


Parents’ median education 
Family median income 
CTBS reading scale score 
Student percent attendance 
Student days old 
Student African American 
Student White 
Student other 
Student female 
Student free/reduced lunch 
Student ESL 
Student special needs 
Student gifted 


Description 

KCCT reading scale score (dependent variable) 

Number of days absent during the school year 
Number of days enrolled during the school year 

0 = completion of grades 0-8, 

1 = grades 9-12 no diploma, 

2 = diploma or equivalent, 

3 = college no degree, 

4 = associate degree, 

5 = bachelor degree, 

6 = master’s degree, 

7 = professional or doctoral degree 

Census tract median education 
Census tract median income 
CTBS reading scale score 

(Days membership - Days absent) / Days membership x 100 
Number of days old on May 1 of testing year 
1 = African American, 0 = non-African American 
1 = White, 0 = non-White 
1 = other, 0 = White or African American 
1 = female, 0 = male 

1 = free/reduced lunch, 0 = full price lunch 
1 = ESL student, 0 = non-ESL student 
1 = special needs student, 0 = non-special needs 
1 = gifted student, 0 = non-gifted student 


The teacher/classroom level variables that were examined included 
the following: classroom aggregated data of selected student level variables, 
number of students in classroom, teacher’s years of experience, teacher’s 
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level of education, and teacher’s rank. Table 2 describes the Level 2, class¬ 
room-related variables. 

Table 2 

Level 2 Teacher Variables 


Variables 

Description 

Teacher experience in schools 

Number of years experience teaching within the 
school district 

Class size 

Number of students in the classroom 

Teacher degree 

4 = bachelor, 

5 = 5th year program, 

6 = 6th year program, 

7 = master’s, 

8 = master’s degree plus 30 graduate hours, 

9 = doctorate 

Teacher female 

1 = female, 0 = male 

Teacher White 

1 = White, 0 = non-White 

Teacher African American 

1 = African American, 0 = non-African American 

Teacher other 

1 = other, 0 = non-African American or non-White 

Teacher rank 

5 = doctorate, 

10 = rank I (master’s degree plus 30 graduate hours), 

15 = master’s +15 graduate credit hours, 

20 = rank II (master’s degree), 

25 = bachelor’s +15 graduate credit hours, 

30 = rank II (bachelor’s degree), 

40 = emergency certified 


Table 3 reports the Level 1, student-related variables with the num¬ 
ber of participants, means, and standard deviations for each year of data. It 
includes three types of data: (a) demographic (e.g., gender, race), (b) aca¬ 
demic (e.g., reading test scores), and (c) non-academic (e.g., attendance). 
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Design and Procedures 

Raudenbush’s and Bryk’s (2002) methodological approach was 
used as a guide to the model development. This analysis was completed 
in five stages: (a) the one-way ANOVA with random effects (i.e., uncon¬ 
ditional model), (b) the one-way ANCOVA model with random effects 
(i.e., conditional model at the student level), (c) the regression model with 
means-as-outcomes (i.e., conditional model at the classroom level), and 
(d) an intercepts- and slopes-as-outcomes model (i.e., lull model), and (e) 
the analysis of residuals. 

To answer the last research question, how consistent are the rat¬ 
ings of classroom level effectiveness over a three-year period, the strat¬ 
egy was to use the residuals from the multilevel models. First, correla¬ 
tions were performed comparing data from all three years. In this sense, 
the variables for the correlations consisted of data generated from each 
classroom for the models that yield residuals. For each model, the year-to- 
year Pearson correlation coefficients were calculated, meaning r for Year 1 
with Year 2, r for Year 1 with Year 3, and r for Year 2 with Year 3. Second, 
a comparison was made after categorizing each classroom by quantiles: 
well-above average (two standard deviations above the mean), above av¬ 
erage (between one and two standard deviations above the mean), average 
(between one standard deviation above the mean and one standard devia¬ 
tion below the mean), below average (between one and two standard devi¬ 
ations below the mean), and well-below average (two standard deviations 
below the mean). The comparisons were made on the number of individ¬ 
ual teachers that remained in the same category compared to the number 
of teachers that shifted across categories. This is a critical element of this 
study since it helped assess the consistency of teacher effectiveness ratings 
across three years for the same teachers. 

Results 

Results are presented based on the guiding research questions noted 
earlier in the paper, and using the sequential FILM modeling approach rec¬ 
ommended by Raudenbush and Bryk (2002), starting with the simpler mod¬ 
els and finalizing with the more complex models: (a) the one-way ANOVA 
with random effects, (b) the one-way Analysis of Covariance (ANCOVA) 
with random effects, (c) the regression with means-as-outcomes, and (d) the 
model with intercepts- and slopes-as-outcomes. 

Research question 1: Is there enough classroom variance (as measured 
by the intra-class correlation) to justify the use of HLM? 

Analytic strategy: One-way ANOVA with random effects (fully uncondi¬ 
tional model) 

The general model is represented by the following equations: 
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Level 1: Y ,j = p 0j + n, 

Level 2 equation: poj = yoo + moj 

Expanded model: Yy = yon + m s + ry 

For the purpose of this research project, Yy is the KCCT reading 
score for the i th student in the j lh classroom. (3oj is the mean outcome for the 
j th classroom. The variable n, is the error term. This model assumes that 
these error terms are normally distributed with a mean of zero and a con¬ 
stant Level 1 variance, c 2 (Raudenbush & Bryk, 2002). In this research the 
Level 2 variable was the classroom. The Level 2 variable yoo is the grand- 
mean outcome of all the students with u oj the random effect associated 
with the j th classroom and is assumed to have a mean of zero and a vari¬ 
ance of Too. In other words, the variable moj is the deviation of classroom j’s 
mean from the grand mean. 

This model is the fully unconditional model in that no predic¬ 
tors are specified at either Level 1 or Level 2. First of all, when examin¬ 
ing teacher effectiveness, the researchers had to determine if a two-level 
model would be more appropriate to use than an OLS regression model. 
As stated previously, HLM analysis is appropriate to use if there is suffi¬ 
cient variance at the classroom level. The required procedure is called in¬ 
tra-class correlation (ICC) and it provides information about the outcome 
variability at each of the two levels; in that sense, ICC measures the pro¬ 
portion of the variance in the outcome that is between the Level 2 units 
(Raudenbush & Bryk, 2002). Sufficient variance has been interpreted as 
10% or more (Ma, 2001). In this study, when the one-way ANOVA with 
random effects was completed, the amount of variance at the classroom 
level was 21% for Years 1 and 2, and 17% for Year 3 (see Table 6 for de¬ 
tails). Therefore, there was more than sufficient variance at the classroom 
level to justify the use of a multi-level analysis. The classroom means 
ranged from 541 to 545. The null hypothesis Ho: Too = 0 provides a test of 
whether there is significant variation among classroom means on KCCT 
reading scores. Table 6 shows that obtained y; statistics for all three data 
sets were significant atp < .001. 

At this initial stage of the analysis, deviance is another important 
statistic to be examined. According to Kreft and De Leeuw (1998): 

The difference in deviation is especially useful to estimate the im¬ 
provement of fit when the between variance is no longer uniquely 
defined. For that reason deviances are considered the most im¬ 
portant feature in the output, and used for an overall evaluation 
of models. As a rule of thumb, in order to reach the conclusion 
that one model is a significant improvement over another, the dif¬ 
ference in deviances between two models should be at least twice 
as large as the difference in the number of estimated parameters, 
(p. 65) 
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The deviances for the one-way ANOVA with random effects in the current 
study (all with two parameters) are reported near the bottom of Table 6. 

Table 6 


Summary of One-Way ANOVA with Random Effects for Three Years of Data 


Model characteristics 

Year 1 

Year 2 

Year 3 

Grand-mean of KCCT 
reading scale score 

541.10 

544.34 

544.52 

95% confidence interval for 
grand-mean 

(538.55, 543.65) 

(542.03, 546.65) 

(542.40, 546.64) 

Student level variance 

1317.85 

1039.66 

1083.35 

Classroom level variance 

353.09 

279.88 

227.28 

95% confidence interval for 
classroom mean 

(504.27, 577.93) 

(511.55,577.13) 

(514.97,574.07) 

X 2 , df (null hypothesis test) 

1,763.41,230* 

1,749.53, 234* 

1,433.18, 235* 

Intraclass correlation 

.21 

.21 

.17 

Deviance 

58,754.27 

54,863.34 

56,218.71 

Estimated parameters 

2 

2 

2 


*p<. 001 


Research question 2: What student level variables are significant predic¬ 
tors of student achievement as measured by their reading scores? 

Analytic strategy’: one-way ANCOVA with random effects (conditional at 
Level 1) 

In this model, the Level 1 equation was an explanatory variable 
that was centered around the grand mean, while the Level 2 intercept 
equation is random and the slope equations are fixed. The equations are 
as follows: 

Level 1 equation: Yy = poj + Pij(Xy - X..) + r ,y 

Level 2 equations: poj = yoo + moj 

Pij = Yio 

Expanded model: Y y = yoo + uoj + y io(Xy - X..) + ry 

This model fixes the coefficient, pij, to be the same for all of the class¬ 
rooms. In this model yio is the pooled within-group regression coefficient 
of Yy on Xy (Raudenbush & Bryk, 2002). The variable po, is the mean out¬ 
come for each classroom adjusted for the difference among these groups 
in Xy. It should be noted that “the Var (ry) = a 2 is now a residual variance 
after adjusting for the Level 1 covariate, Xy” (Raudenbush & Bryk, 2002, 
p. 25). 


254 


Planning and Changing 





Exploring Teacher Effectiveness 


The first major research question was associated with student level 
variables that are significant predictors of student achievement's measured 
by their reading scores. An ancillary question was associated with the per¬ 
cent of the variance in reading scores accounted by student level variables. 
The student level variables that were significant for all three years were 
CTBS reading scale score, free or reduced lunch, female, advance program, 
African American, and attendance. The students’ CTBS reading scores had 
the strongest relationship of all the variables. The /-values for CTBS reading 
scores were all > 33, while all the other /-values were < 10. 

The directionality of coefficients associated with student variables 
were as follows: The coefficient for days absent was negative (i.e., the 
more days absent the more likely their KCCT reading score was lower), 
the coefficient for CTBS reading scale score was positive (i.e., the high¬ 
er the score on the CTBS the more likely the KCCT reading scale score 
was higher), the coefficient for African American was negative (i.e. Afri¬ 
can American students were more likely to score lower on KCCT reading 
test), the coefficient for female was positive (i.e., female students were 
more likely to have a higher KCCT reading scale score), the coefficient for 
free or reduced lunch was negative (i.e., students that qualified for free or 
reduced lunch were more likely to score lower on the KCCT reading test), 
and the coefficient for advance program was positive (i.e., advance pro¬ 
gram students were more likely to score higher on the KCCT reading test). 
In summary, significant student level variables can be classified into: two 
academic measures (i.e., CTBS reading scale score and advance program) 
and four non-academic measures (i.e., gender, race, socio-economic sta¬ 
tus, attendance). See Table 7 for details. 

Table 7 


Summary of ANCOVA with Random Effects Coefficients 



Year 1 

Year 2 

Year 3 

Variables 3 

Coefficient 

t-value 

Coefficient 

t-value 

Coefficient 

t-value 

Intercept 

542.72 

765.74 

544.44 

802.09 

544.46 

874.38 

Days absent 

-.39 

-6.50 

-.29 

-3.37 

-.29 

-4.99 

CTBS scale scores 

.45 

33.71 

.45 

33.61 

.44 

34.29 

African american 

-7.95 

-8.95 

-6.82 

-8.49 

-5.61 

-6.57 

Female 

6.45 

9.59 

6.65 

9.72 

6.48 

8.58 

Free or reduced 
lunch 

-5.98 

-7.00 

-4.61 

-5.81 

-6.90 

-6.90 

Advance program 

10.95 

4.78 

12.41 

7.47 

13.35 

8.23 


a All t-values are significant at the p < .001 level with the exception of Year 2 Days Absent, 
which is significant at/) < .01 level. 
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The ancillary research question asked about the reduction in the 
Level 2 variance (classrooms) from the significant Level 1 (student) vari¬ 
ables. The Level 2 variance from the ANCOVA with random effects model 
is reduced to 14% (Year 1), 12% (Year 2), and 9% (Year 3). This model ex¬ 
plained 36%, 43%, and 50% of the Level 2 variance in Year 1, Year 2, and 
Year 3 respectively. Thus, the amount of unexplained Level 2 variance is 
64%, 57%, and 50% for Year 1, Year 2, and Year 3, respectively. As stated 
previously, this indicates that a large portion of the Level 2 variance can be 
attributed to the differences among the students in these classrooms. The 
Level 2 error term is still significant, meaning that there is still some Level 
2 variance to be explained. 

Research question 3: What classroom level variables are significant pre¬ 
dictors of student achievement as measured by their reading scores? 

Analytic strategy: Regression with means-as-outcomes (conditional at 
Level 2) 

The researchers used the same Level 1 model from Stage 1, but 
introduced Level 2 variables. Thus the Level 2 equation with one predic¬ 
tor variable can be written as: 

Level 2 equation: poj = yoo + yoiWj + m, 

Expanded model: Y,j = yoo + yoiWj + woj + nj 

Wj is a Level 2 predictor. In this model moj now represents the residual and 
uoj and too is now the residual or conditional variance in the intercept po, af¬ 
ter controlling for the effects of Wj (Raudenbush & Bryk, 2002). For this 
research, all Level 2 predictors were tested to determine which were statis¬ 
tically significant. Then various combinations of predictor variables were 
explored. An intra-class correlation coefficient was calculated to determine 
the conditional variance in KCCT reading scores controlling for all Wj. 

In this model, the outcome is predicted by classroom-level char¬ 
acteristics used as predictors (i.e., conditional at Level 2) and answers the 
second major research question. The second major research question was 
concerned with significant variables at Level 2 and—as an ancillary ques¬ 
tion—the amount of Level 2 variance accounted for by those variables. Al¬ 
though the three years originally produced slightly differing models, they 
all accounted for approximately 32-39% of the Level 2 variance. There 
were three Level 2 variables that were significant for all three years: teach¬ 
er years experience, mean class education index, and mean class CTBS 
reading scale score. These three Level 2 variables explained approximate¬ 
ly 29-32% of the Level 2 variance in school means on KCCT reading. 
The conditional ICC for the three variable models were .155 (Year 1), .161 
(Year 2), and .126 (Year 3). 
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Research question 4: When combining both student and classroom level 
variables, what are significant predictors of student achievement? 

Analytic strategy: Intercepts- and slopes-as-outcomes (full model) 

This model is the same as the random-coefficients regression 
model, except now are included Level 2 predictors Wj. The equations that 
represent this model are: 

Level 1 equation: Yy = poj + Pij(Xy - X.j) + r t] 

Level 2 equations: poj = yoo + yoiWj + moj 

pij = yio + ynWj + wij 

Expanded model: _ _ _ 

Yy = yoo + yoiWj + Moj + yio(Xij — X.j) + ynWj(Xij — X.j) + Mij(Xy — X.j) + ry 

The third major research question was concerned with developing a 
full prediction model. This is a natural step after identifying the significant 
Level 1 and Level 2 variables. Answering this research question involved 
constructing several models. When originally combining the models, there 
were six Level 1 variables (CTBS reading scale score, African American, 
female, advance program, total absences, and free and reduced lunch), three 
Level 2 variables predicting the level one intercept (teacher years experi¬ 
ence, classroom education index, and classroom CTBS reading scale score), 
and one random coefficient (i.e., for CTBS reading scale score). All of these 
variables, in addition to the Level 1 error term, the intercept, and the Level 
2 random component, were included in the full prediction model. The ratio¬ 
nale for only one random slope is that none of the other Level 1 variables 
had a significant residual term, which suggests that there is no significant 
variation to explain after controlling for the Level 1 variables. 

When combining the significant Level 1 and Level 2 variables 
into the same model, two of the Level 2 variables, classroom education in¬ 
dex and classroom CTBS reading scale score, were no longer significant 
for all three years. Therefore, a revised model dropped the latter Level 2 
variables. After dropping the two variables it was noted that for Year 2 and 
Year 3 there was little difference when examining the deviance, but there 
was a small difference in the Year 1 models. The revised model explained 
between 70% and 77% of the Level 2 variance of the intercept, which rep¬ 
resented the mean KCCT reading score. See Table 8 for details. 
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Table 8 


Comparison of Intercepts- and Slopes-as-Outcomes Models for Coefficients 

off? 


Variable 

Year If 

Deviance 

Year 2f 

Deviance 

Year 3f 

Deviance 

Parents’ education index 

50486** 

48473* 

49984* 

Student total attendance 

50492** 

n.s. 

n.s. 

Student advance program 

50483** 

n.s. 

49987** 

Teacher years experience 

n.s. 

n.s. 

49995 ** 

Median education 

50488** 

n.s. 

49985* 

Teacher non-Caucasian 

n.s. 

n.s. 

49988** 

Teacher race other 

n.s. 

n.s. 

49989** 

Student free and reduced 

n.s. 

n.s. 

49986** 

Student percent attendance 

50482** 

n.s. 

n.s. 

Intercepts- and slopes-as-outcomes Model 2 

50483 

48477 

49986 


Note. n.s. = not significant, 
t Parameter = 4, * p< .01, ** p < .05 


Research question 5: How consistent are the ratings of classroom level 
effectiveness over a three-year period of time? 

Analytic strategy: Analysis of residuals 

The last major research question deals with the consistency of 
the teacher effectiveness ratings using the residuals from year to year. To 
examine the models and their consistency, three steps were taken. First, 
the researchers examined three models. The first model (A) used only the 
CTBS reading scale score as a Level 1 predictor with the coefficients fixed 
and a random intercept. The second model (B) used the six significant 
Level 1 variables: absences, CTBS reading scale score, African-Ameri¬ 
can, female, student free or reduced lunch, and advance program and had 
fixed coefficients and a random intercept. The third model (C) is the same 
as the second model except the intercept has a predictor variable of Teach¬ 
er Years Experience. 

When examining the correlations, all three models are highly cor¬ 
related with each other. All correlations are above .92. Thus, all three mod¬ 
els appear to be viable options for obtaining teacher effectiveness indices. 
Table 9 shows the Pearson correlation coefficient between the models for 
each year. The number of teachers was 241, 235, and 236 for Year 1, Year 
2, and Year 3, respectively. 
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Table 9 


Correlation of Models 



Model A/Model B 

Model B/Model C 

Model A/Model C 

Year 1 

.98 

.95 

.92 

Year 2 

.97 

.98 

.95 

Year 3 

.96 

.98 

.93 


Note. All correlations are significant at the .001 level. 


The second part of this analysis compared the correlation between 
the models and the null model across the three years to determine which 
model was most consistent. Table 10 shows the Pearson correlation coeffi¬ 
cient of the residuals from the different models across the three years. The 
number of teachers was 150, 149, and 116 for Year 1, Year 2, and Year 3, 
respectively. All four models provided a significant correlation at the p < 
.001 level, but the Null model and Model A provided the most consistent 
residuals over time. 

Table 10 


Correlation of Residuals by Year 



Null model 

Model A 

Model B 

Model C 

Year 1 

.72 

.63 

.56 

.56 

Year 2 

.72 

.54 

.44 

.42 

Year 3 

.68 

.59 

.52 

.50 


Note. All correlations are significant at the .001 level. 


The third part of this analysis included a comparison of the rank¬ 
ings for teachers that were in all three years of the study {n = 100). Follow¬ 
ing the established practice in value added models, an “effective” teacher 
would be associated with a positive residual and an “ineffective” teacher 
with a negative residual. Based on the residuals from Model C, all teach¬ 
ers were given a ranking of 2 = well above average (two or more standard 
deviations above the mean), 1 = above average (the standard deviation is 
between one and two standard deviations above the mean), 0 = average 
(between one standard deviation above and below the mean), -1 = below 
average (the standard deviation is between one and two standard devia¬ 
tions below the mean) and -2 = well below average (two or more standard 
deviations below the mean). Table 11 shows a summary of the patterns 
of teacher rankings over the three years, including the sum frequency of 
their residual rankings (for example, for the first two teachers, it is noted 
that their residuals are 2,1,1 and 1,1,2 which means it equals 4). Only one 
teacher crossed through the average range (i.e., had both a positive and a 
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negative residual ranking). Four teachers had three different levels of per¬ 
formance (i.e., a different level each year). 

Table 11 

Summary of Individual Teacher Rankings 


Sum Individual teacher combinations of residuals-year 1, year 2, year 3 

(frequency) (frequency) 


* cs 

II 

2, 1, 1 
(n=l) 

1, 1,2 
(n=l) 



m S' 

ii 

1 ,1, 1 
(n = 3) 

2 , 0, 1 
(n=l) 

0 , 2, 1 
(n=l) 

0, 1,2 
(n=l) 

2 

(n= 10) 

1, 1,0 
(n=5) 

0 , 1, 1 
(n = 3) 

2, 0,0 
(n = 2) 


1 

(n = 17) 

1 ,0,0 
(a =5) 

0 , 1,0 
(n = 5) 

0 , 0, 1 
(n = 7) 


0 

(n = 40) 

0 , 0,0 
n = 39) 

1 , —1, 0 
(n = 1) 



-1 

(n=14) 

-1,0,0 
(ii = 2) 

0 , -1, 0 
(n = 2) 

0 , 0, -1 
(n = 10) 


'S' 

II 

-1,-1, 0 
(n = 2) 

-1, 0,-1 
(n=l) 

0 , -1, -1 
(n=l) 

-2, 0, 0 
(n = 1) 

-3 

( n = 2) 

—1, —1,— 1 
(n = 2) 




TFT 

II 

Ch 

-2,-1,-1 
(n=l) 

-1,-1.-2 
(n=l) 




Note. Under Individual Teacher Combinations, within the grouped numbers, first number 
refers to Year 1 ranking, second number refers to Year 2 ranking, and third number refers 
to Year 3 ranking. 

In summary, FILM models were constructed to assess predictors 
of student performance and to rank-order classroom performance using re¬ 
siduals (see Table 12). 
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Table 12 

Key Results of HLM Models for KCCT Reading Years 1—3 


Predictors 


Model 

Level-1 significant 

Level-2 significant 

Results 

One-way ANOVA 
with random effects 
(unconditional 
model) 

n/a 

n/a 

Intraclass correlation 
between . 17 and .21 

ANCOVA with 
random effects 
(conditional at 
level-1 or using 
student-level 
predictors) 

1) Days absent 

2) CTBS reading 
scale score 

3) African American 

4) Female 

5) Free or reduced 
lunch 

6 ) Advance program 

n/a 

Intraclass correlation 
between .09 and .14 

Regression with 
means-as-outcomes 
(conditional at 
level-2 or using 
classroom-level 
predictors) 

n/a 

1) Mean teacher 
experience 

2) Mean 
education index 

3) Mean CTBS 
reading scale 
score 

Conditional intra¬ 
class correlation 
between .13 and .16 

Intercepts- and 
slopes-as-outcomes 
models (full model 
or using student - 
and classroom-level 
predictors) 

1) Days absent 

2) CTBS reading 
scale score 

3) African American 

4) Female 

5) Free or reduced 
lunch 

6 ) Advance program 

Teacher years of 
experience 

This model explains 
between .70 and .76 
of the variance of the 
level-2 intercept 


Discussion 

Despite the black box associated with evaluations of teaching- 
and-leaming processes (Munoz, 2005), researchers and practitioners are 
coming to an agreement that teachers are the most important factor deter¬ 
mining a child’s educational success. Teachers affect student learning in 
multiple ways (Stronge, 2007, 2010), particularly how students learn (i.e., 
pedagogy), what students learn (i.e., curriculum design), and how much 
students learn (i.e., achievement). This is particularly important as the re¬ 
authorization of the Elementary and Secondary Education Act (ESEA) 
looms in our high-stakes accountability environment across our nation’s 
public schools. 


Vol. 42, No. 3/4, 2011, pp. 241-273 


261 






Munoz, Prather, 
Stronge 


Application of Various Models to Teacher Effectiveness Analyses 

From the start of this research the overriding question has been wheth¬ 
er there is a statistical model that will identify effective teachers. The approach 
decided upon for the study was a value-added approach. There are several ad¬ 
vantages of using the value-added approach. These advantages include taking 
into account the student’s prior achievement, student characteristics, class¬ 
room characteristics, and the effect students have on each other (Palardy & 
Rumberger, 2008). The best model for determining the teacher’s value-added 
score was a regression-based, multi-level model known as HLM. 

Multiple models were obtained to determine the teacher effective¬ 
ness ratings. The null model was only used to assess whether the ICC was 
enough to appropriately use HLM. The first model, which is similar to the 
one used for TVAAS, is based on a student’s previous scale score (CTBS 
reading scale score). This is a very efficient model due to the reliance on pre¬ 
vious test scores. The second and third models are similar in many respects 
to the Dallas model because they also include student characteristics. 

For this study, the third model was the one the researchers used 
to obtain the teacher effectiveness ratings. The third model uses teacher 
years experience to predict the level one intercept. An argument could be 
made to use the second model that examines teacher effectiveness regard¬ 
less of their experience. However, the researchers decided to use the mod¬ 
el that included teacher experience as a predictor. The reason is that the 
more experienced teachers could potentially obtain better results and the 
researchers did not want to over-identify newer teachers as less effective, 
because they could be very effective for their stage of career. The research¬ 
ers would recommend using the first (only previous test scores at the stu¬ 
dent level) or the third model (previous test scores at the student level and 
teachers’ years of experience at the classroom level). Only previous test 
scores as predictor is definitely more efficient (Sanders, 2000; Sanders & 
Horn, 1995; Sanders & Rivers; 1996). 

The multi-level model was a natural fit for school data, since stu¬ 
dents are nested within classrooms and since there was enough level-two 
variance (i.e., ICC). The proposed model is consistent with previous re¬ 
search in that prior achievement is the definitively strongest predictor to 
current achievement. In the proposed model, the CTBS reading scale score 
was significantly the largest predictor of the KCCT reading scale score. This 
supports the TVAAS in their claim that students’ prior test scores is the best 
control to determine teacher effectiveness. 

Previous research on years of experience still has many unanswered 
questions. In this study, teacher experience was a valuable predictor of stu¬ 
dent learning at the classroom level, but the knowledge base shows mixed 
findings on the strength of this predictor (Munoz & Chang, 2007; Rockoff, 
2004; Rivkin, Hanushek, & Kain, 2005; Wayne & Youngs, 2003). For exam¬ 
ple, Munoz and Chang (2007) indicated that years of experience have little 
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effect on student achievement in high school reading achievement gains in 
a large urban district. Contrasting with the above finding, a study by Wayne 
and Youngs (2003) found that most of the studies they reviewed on student 
achievement and teacher experience yielded a positive effect of teaching ex¬ 
perience. Also, more experienced teachers are more likely to be drawn to 
higher performing schools (Cochran-Smith & Zeichner, 2005). Generaliza¬ 
tions of these findings are limited as teacher experience is often impacted by 
the subject area and the market forces. 

Stability of Teacher Effectiveness Classifications 

In terms of the consistency of “classifications,” the following ob¬ 
servations were made from the examination of the teacher rankings. There 
were no teachers that changed over two levels. The largest changes in clas¬ 
sification from one year to the next was two levels, from well above av¬ 
erage to average (n = 2), average to well above average (n = 1), well be¬ 
low average to average (n= 1), average to well below average (n = 2), and 
above average to below average (n = 1). As already noted, it is interesting 
that only one teacher crossed over the average range (i.e., the ranking went 
from above average to below average). 

With the limited movement of teachers across classifications, es¬ 
pecially the restricted number of cases with movement across more than 
one classification (above, average, or below), the findings tend to suggest 
stable and trustworthy teacher effectiveness ratings over multiple years. 
In fact, of the 100 teachers included in the study across all three years, 35 
teachers remained in the above average classification for all three years, 39 
teachers remained on the average classification for three years, 25 teach¬ 
ers remained on the average classification, and only one teacher crossed 
the below and above classification for the three years. Given this relatively 
high level of consistency, it appears that a single year rating for a teacher 
is relatively stable (Sanders & Horn, 1994). When three years of data pro¬ 
duce three-year average classifications, the stability for classifying teach¬ 
ers’ effectiveness is even more trustworthy. 

Nonetheless, we do not advocate using value-added modeling as 
a single source for identifying teacher effectiveness. Rather, we strongly 
encourage using measures such as those produced in this examination as 
merely one source among multiple data sources to assess teacher effective¬ 
ness (Peterson, 2006). Moreover, we encourage due caution when interpret¬ 
ing and applying findings from value-added modeling (or, for that matter, 
any applicable data source, such as classroom observations, student surveys, 
etc.) in judging teacher evaluation. Multiple data points are always needed. 
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Policy Implications 

The policy implications of this study are analyzed using two semi¬ 
nal frameworks for connecting teacher evaluation to student achievement: (a) 
nine guidelines developed by James Stronge and Pamela Tucker (2000), and 
(b) four guidelines developed by Jason Millman (1997). The researchers un¬ 
derstand that there is no perfect way for evaluating policy implications for 
teacher evaluation systems like the one investigated in this study. Currently, 
and regardless of the approach, teacher evaluation systems are deemed by 
most teachers (and sometimes even for administrators) to be extremely stress¬ 
ful and of little value. 

Evaluation of proposed model using guidelines from James 
Stronge and Pamela Tucker. Stronge and Tucker (2000) provided nine 
guidelines for using testing data models as part of teacher evaluations, in¬ 
cluding (a) student learning as only one component, (b) the importance of 
context, (c) the value of growth, (d) longitudinal perspective on students, 
(e) gain scores limitations, (f) time frame, (g) fairness and validity of mea¬ 
sures, (h) alignment issues, and (i) teaching to the test issues. These guide¬ 
lines provide discussion points for evaluating the strengths and weakness¬ 
es of the current model for policy implications. 

The first guideline is to “use student learning as only one compo¬ 
nent of a teacher evaluation system that is based on multiple data sources” 
(Stronge & Tucker, 2000, p. 53). The current model uses only the KCCT 
reading score and that is problematic. In high-stakes testing environments, 
it is often easy to equate single-subject test scores to teacher effectiveness. 
This model should be interpreted as teacher effectiveness on teaching read¬ 
ing as measured by the KCCT reading test. The second guideline is “when 
judging teacher effectiveness, consider the context in which the teaching and 
learning occur” (p. 54). There are contextual issues that may inflate or de¬ 
flate student scores beyond the student’s previous attainment and the teach¬ 
er’s effectiveness. The models used for this study examined some of those 
variables (e.g., absenteeism), but did not examine many of the other contexts 
mentioned. The third guideline is to “use measures of student growth versus 
a fixed achievement standard or goal” (p. 55). This is one of the strengths 
of this model in that students’ performance is examined in terms of their 
own previous academic attainment. By using previous achievement scores, 
teachers are more likely to be judged on their own ability to increase stu¬ 
dent learning. 

The fourth guideline is to “compare learning gains from one point 
in time to another for the same students, not different groups of students” (p. 
55). This is a notable strength of the proposed model. Teachers’ effective¬ 
ness is determined by comparing students’ gains to their own prior achieve¬ 
ment, not by comparing their gains to a previous group of students’ achieve¬ 
ment. The fifth guideline is to “recognize that gain scores have pitfalls that 
must be avoided” (p. 56). Many models, including the proposed model, do 
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not take into account “the regression to the mean” phenomenon which is the 
likelihood that those students that scored high on the previous test are more 
likely to score lower on the next test and conversely. Another pitfall can be a 
ceiling effect if the pre- and post-test measures are not sufficiently robust to 
account for growth of high ability students. This would be a problem partic¬ 
ularly if there are large numbers of high ability students grouped in the same 
classes. The sixth guideline is to “use a time frame for teacher evaluation 
that allows for patterns of student learning to be documented” (p. 56). The 
proposed model produced stronger results when examining teacher scores 
over a period of three years instead of a single year, “snapshot,” model. 

The seventh guideline was to “use fair and valid measures of stu¬ 
dent learning” (p. 56). The test used for this study has undergone severe 
scrutiny to determine validity and reliability. The eighth guideline is to 
“select student assessment measures that are most closely aligned with 
existing curriculum” (p. 57). The curriculum that was being tested was 
well documented for teachers. The curriculum included content guides, 
often called content standards, which explicitly stated what content items 
the students should be taught and would be tested. The ninth guideline is 
to “not narrow the curriculum and limit teaching to fit a test” (p.58). The 
weakness of the proposed model is that it only examines the results from 
the KCCT test and, therefore, is more likely to positively recognize teach¬ 
ers that may be narrowing the curriculum to fit the KCCT test, commonly 
referred to as “teaching to the test.” 

Evaluation of proposed model using guidelines from Jason 

Millman. Millman (1997) provided four criteria for use when examin¬ 
ing high-stakes assessment systems and their relationship to teacher eval¬ 
uation: (a) fairness, (b) comprehensiveness, (c) competitiveness, and (d) 
consequential validity. 

The first criterion is that of fairness. According to Millman, “the 
single most frequent criticism of any attempt to determine a teacher’s ef¬ 
fectiveness by measuring student learning is that factors beyond a teach¬ 
er’s control affect the amount students learn” (Millman, 1997, p. 244). 
The proposed model controls for some of these factors, including atten¬ 
dance, race, socio-economic status, and prior achievement. 

The second criterion is that of comprehensiveness. Millman stated 
that “this criterion refers to the degree the range of intended learning out¬ 
comes are represented in the assessment measures” (p. 245). The problem 
here is the potential loss of teaching an enriched curriculum as well as teach¬ 
ers finding themselves “faced with disincentives to follow their perceived 
best practices if doing so would detract from student performance according 
to the accountability system” (p. 245). The third criterion is that of competi¬ 
tiveness. Millman’s question is: how competitive is this method of teacher 
evaluation compared to other methods used? There is currently no known 
perfect evaluation method. Although different teacher evaluation methods 
have advantages and disadvantages, the proposed model appears to be com- 
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petitive in comparison to other methods. The fourth criterion that Millman 
proposed is that of consequential validity. This criterion examines the de¬ 
sired and undesirable effects of a teacher evaluation method, even if the 
method is accurate and appropriately used. At this point, since the proposed 
model has not been used as an evaluation tool, the consequential validity has 
not been determined. Assuming the model is used, the potential consequen¬ 
tial validity could be positive if teachers view the model as their personal 
impact on student learning (i.e., adding value) and that student characteris¬ 
tics are taken into account. The potential negative side could be that teachers 
are reduced to just a number and the variables do not sufficiently account for 
all student characteristics that could impact student learning. 

Implications for Practice 

The value-added method explored in this study gives administrators 
an objective measure to be able to identify the most and least effective teach¬ 
ers as determined by the KCCT reading scale scores. By using the index 
scores (i.e., teachers rated as well above average to well below average), ad¬ 
ministrators can identify which teachers need assistance in helping their stu¬ 
dents achieve. Similarly, administrators can identify those teachers whose 
students are achieving at higher than expected rates. However, it should be 
noted that value-added models such as those tested in this study only iden¬ 
tify those teachers that are being successful or not; they do not identify the 
behaviors or instructional techniques that are making them successful. 

In an era of high stakes testing, any additional information educa¬ 
tors and policy makers can obtain to know which teachers are adding value 
and which are not is useful information that can be used to assist in devel¬ 
oping strategies to increase student success. It is recommended that until 
further validation has occurred, these scores should be used in connection 
with numerous other evaluation tools already at teacher evaluators’ dis¬ 
posal, such as teacher observations, teacher portfolios, and student work. 
For example, the proposed model could be used to assist principals or oth¬ 
er teacher evaluators in identifying low performing teachers and give them 
some assistance and guidance in how to become a better teacher. Again, 
this should be done with multiple data sources (Peterson, 2006). 

Research Limitations 

As we hope is evident, the proposed value-added model has several 
advantages, but there also are numerous limitations. Teacher effectiveness 
is a complex phenomenon when it comes to defining it or—even more—on 
how to measure it. Selected limitations have already been mentioned when 
evaluating the proposed model by the guidelines provided by Stronge and 
Tucker (2000) and the criteria provided by Millman (1997). Additionally, 
another limitation and a primary concern about the use of this model is that 
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it identifies effective teachers based solely on the students’ Kentucky KCCT 
Reading scores. If one is not careful, one could easily over-simplify this re¬ 
sult and apply the label of “effective” or “ineffective” to a teacher in general. 
It is important to remember that there is more to being an effective teacher 
than just producing reading results on a point-in-time test, even if the test 
has extensive validation information to support its use. 

A second limitation is that the objective of this research was to 
identify the most, and least, effective teachers. This is not to be confused 
with determining effective, and ineffective, teaching strategies. The pro¬ 
posed model can only be used to assist in identifying teachers who are 
achieving greater, or lower, gains than expected. A third limitation is that 
the teachers are assigned to students they have in their classes during test¬ 
ing. The data do not account for high mobility students within a school or 
classroom. As noted earlier, students were excluded from the data set if 
they were not enrolled within the school district for at least 100 days, but 
the data set does not take into account whether the student changes schools 
within the district. Related to this issue, each year of data was treated in¬ 
dependently and later compared on a yearly basis. Another approach could 
be to use an HLM growth model that links all three years of data (Rauden- 
bush & Bryk, 2002). 

Recommendations for Future Research 

At least four recommendations can be presented for future re¬ 
search. This study focused solely on the Kentucky Reading Test for fourth 
graders. As a result, one recommendation for future studies is to develop 
similar models to include other tested content areas as well as other tested 
grade levels. The second recommendation is to validate the teacher rank¬ 
ings by having principals complete a rating scale on teachers and then 
calculate the correlation between the principal’s ratings and the model in¬ 
dexes. In addition, trained observers could study teachers who were iden¬ 
tified as most effective and least effective by the HLM model to deter¬ 
mine productive and counter-productive instructional strategies (similar 
to the study conducted by Stronge et. al., 2008). A third recommendation 
is to study the impact of this kind of modeling when it is applied to Title I 
schools or schools with high teacher turnover. 

The fourth and final recommendation for future research is to ex¬ 
plore the role of teacher-to-teacher collaboration and the emergence of 
professional learning communities as related to classroom-based student 
achievement. Schools that emphasize distributive leadership and shared 
decision-making have produced notable gains in student achievement 
(Leithwood, Louis, Anderson, & Wahlstrom, 2004). Additionally, past re¬ 
search in urban elementary schools indicated that teachers who work to¬ 
gether on school improvement efforts tend to increase student achieve¬ 
ment results (Goddard, Goddard, & Tschannen-Moran, 2007). 
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Conclusion 

The most important school-related contributing factor to student 
achievement is the quality of teaching. While traditional approaches to 
teacher effectiveness have focused heavily on input and process variables, 
the purpose of this study was to explore the use of HLM to determine teacher 
effectiveness as measured with reading outcomes. Results identified high-, 
average-, and low-performing classrooms, significant predictors of student 
learning gains, and addressed stability of teacher effectiveness ratings. The 
study found limited value-added fluctuations from year to year as succeed¬ 
ing cohorts of students move through these teachers’ classrooms. Although 
not conclusive due to the need for replication, the instability of classroom ef¬ 
fectiveness indices found in this study was minimal and indicates the poten¬ 
tial usefulness of value added as an indicator of future performance. 

Multi-level models hold substantial promise as a tool for help¬ 
ing determine teacher effectiveness. Despite multiple potential drawbacks, 
multi-level analysis may provide valuable information to use in personnel 
decisions and teacher compensation systems. The fact that value-added es¬ 
timates will never measure precisely the quality of instruction in a class¬ 
room, to a certain extent, recognizing the methodological limitations can 
facilitate more informed uses of standardized test results and the develop¬ 
ment of stronger value-added models. 

Most certainly, many questions remain unanswered. For instance, 
Rothstein (2010) raised issues about statistical bias when using value-add¬ 
ed estimates, even when adjusting for students’ prior achievement measures. 
Rothstein argued that some teachers might be assigned students that are sys¬ 
tematically different (e.g., motivation, parental engagement) which affect their 
performance and these differences may not be captured by prior achievement 
or demographic variables. Ravitch (2010) added that value-added assessment 
has potential problems such as narrowing the curriculum, promoting teaching 
to the test, and discarding social factors that influence student scores. 

Teacher effectiveness is at an important crossroad. Perhaps more 
work is needed before the value-added techniques can be used to support 
important decisions about teachers; in particular, we need models that al¬ 
low researchers to build on the tradition of process-product research (Good 
& Brophy, 1997) which means that we need to peer inside the black box of 
classrooms and not only focus on outcomes (Rowan, Correnti, & Miller, 
2002). If these models are used when evaluating teachers, they should be 
used along with other measures of teacher performance in our schools (Pe¬ 
terson, 2006; Stronge, Ward, & Grant, 2011). These models will also have 
to take into consideration the collaborative nature of teacher-to-teacher 
relationships in professional learning communities and not focus on re¬ 
warding the already archaic individuality of the teaching profession. A 
better identification of effective teachers and an improved understanding 
of what characterizes good teaching have significant implications for deci- 
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sion-making regarding the preparation, recruitment, selection, compensa¬ 
tions, in-service professional development, and evaluation of teachers. No 
matter the perspective on assessing teacher effectiveness—our children 
deserve the best teachers, teachers who can make a difference in their suc¬ 
cess in school, college, and life. 

End Notes 

a Student records deleted due to the 100 day requirement were 108 (Year 1), 71 
(Year 2), and 108 (Year 3). 

b This classroom size requirement caused the removal of 979 (Year 1), 831 (Year 2) 
and 858 (Year 3). 
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